Memory Hierarchy Considerations for Fast Transpose and Bit-Reversals

نویسندگان

Kang Su Gatlin

Larry Carter

چکیده

This paper explores the interplay between algorithm design and a computer’s memory hierarchy. Matrix transpose and the bit-reversal reordering are important scientific subroutines which often exhibit severe performance degradation due to cache and TLB associativity problems. We give lower bounds that show for typical memory hierarchy designs, extra data movement is unavoidable. We also prescribe characteristics of various levels of the memory hierarchy needed to perform efficient bit-reversals. Insight gained from our analysis leads to the design of a near optimal bit-reversal algorithm. This Cache Optimal Bit Reverse Algorithm (COBRA) is implemented on the Digital Alpha 21164, Sun Ultrasparc 2, and IBM Power2. We show that COBRA is near optimal with respect to execution time on these machines and performs much better than previous best known algorithms. Copyright 1998 IEEE. Published in the Proceedings of HPCA 5, 9-13 January 1999 in Orlando, FL. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works, must be obtained from the IEEE. Contact: Manager, Copyrights and Permissions / IEEE Service Center / 445 Hoes Lane / P.O. Box 1331 / Piscataway, NJ 08855-1331, USA. Telephone: + Intl. 732-562-3966. fkgatlin, [email protected].

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

An Efficient in-Place 3D Transpose for Multicore Processors with Memory Managed Memory Hierarchy

3D transpose is an important operation in many large scale scientific applications such as seismic and medical imaging. This paper proposes a novel algorithm for fast in-place 3D transpose operation. The algorithm exploits SIMD multicore architecture with software managed memory hierarchy. Such architectural features are present in the next generation processors, such as the Cell BE processor. ...

متن کامل

Area-Speed-Efficient Transpose-Memory Architecture for Signal-Processing Systems

This paper presents the design and analysis of a high-speed implementation of a new transpose memory architecture. The proposed memory structure achieves almost 4X improvement in speed while consuming 46% less area, compared to prior work. For example, an 8X8 transpose memory with 12-bit input/output resolution has been implemented in 140 slices on a Virtex-7 Xilinx FPGA platform, achieving 107...

متن کامل

Fast Bit-Reversals on Uniprocessors and Shared-Memory Multiprocessors

In this paper, we examine different methods using techniques of blocking, buffering, and padding for efficient implementations of bit-reversals. We evaluate the merits and limits of each technique and its application and architecture-dependent conditions for developing cache-optimal methods. Besides testing the methods on different uniprocessors, we conducted both simulation and measurements on...

متن کامل

Towards an Optimal Bit-Reversal Permutation Program

The speed of many computations is limited not by the number of arithmetic operations but by the time it takes to move and rearrange data in the increasingly complicated memory hierarchies of modern computers. Array transpose and the bit-reversal permutation – trivial operations on a RAM – present non-trivial problems when designing highly-tuned scientific library functions, particular for the F...

متن کامل

A Portable 3D FFT Package for Distributed-Memory Parallel Architectures

1 I n t r o d u c t i o n Multidimensional FF’I’s are used frequently in engineerillg and scientific calculations, especially in image processing. Parallel implementations of FFT generally follow two approaches. One is the binary-exchange approach[l ,2], where data exchanges take place in all pairs of processors with processor numbers differing by one bit. Another one is the transpose approach[...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 1999

Memory Hierarchy Considerations for Fast Transpose and Bit-Reversals

نویسندگان

چکیده

منابع مشابه

An Efficient in-Place 3D Transpose for Multicore Processors with Memory Managed Memory Hierarchy

Area-Speed-Efficient Transpose-Memory Architecture for Signal-Processing Systems

Fast Bit-Reversals on Uniprocessors and Shared-Memory Multiprocessors

Towards an Optimal Bit-Reversal Permutation Program

A Portable 3D FFT Package for Distributed-Memory Parallel Architectures

عنوان ژورنال:

اشتراک گذاری